Syntactic analyses and named entity recognition for PubMed and PubMed Central - up-to-the-minute
نویسندگان
چکیده
Although advanced text mining methods specifically adapted to the biomedical domain are continuously being developed, their applications on large scale have been scarce. One of the main reasons for this is the lack of computational resources and workforce required for processing large text corpora. In this paper we present a publicly available resource distributing preprocessed biomedical literature including sentence splitting, tokenization, part-of-speech tagging, syntactic parses and named entity recognition. The aim of this work is to support the future development of largescale text mining resources by eliminating the time consuming but necessary preprocessing steps. This resource covers the whole of PubMed and PubMed Central Open Access section, currently containing 26M abstracts and 1.4M full articles, constituting over 388M analyzed sentences. The resource is based on a fully automated pipeline, guaranteeing that the distributed data is always up-to-date. The resource is available at https://turkunlp. github.io/pubmed_parses/.
منابع مشابه
A Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features
Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...
متن کاملTagger: BeCalm API for rapid named entity recognition
Most BioCreative tasks to date have focused on assessing the quality of text-mining annotations in terms of precision of recall. Interoperability, speed, and stability are, however, other important factors to consider for practical applications of text mining. The new BioCreative/BeCalm TIPS task focuses purely on these. To participate in this task, I implemented a BeCalm API within the real-ti...
متن کاملNamed Entity Recognition in Persian Text using Deep Learning
Named entities recognition is a fundamental task in the field of natural language processing. It is also known as a subset of information extraction. The process of recognizing named entities aims at finding proper nouns in the text and classifying them into predetermined classes such as names of people, organizations, and places. In this paper, we propose a named entity recognizer which benefi...
متن کاملBANNER-CHEMDNER: Incorporating Domain Knowledge in Chemical and Drug Named Entity Recognition
Exploiting unlabeled text data to leverage the system performance has been an active and challenging research topic in text mining, due to the recent growth of the amount of biomedical literature. Named entity recognition is an essential prerequisite task before effective text mining of biomedical literature can begin. The participants of the CHEMDNER task of the BioCreative IV challenge are as...
متن کاملOne Tagger, Many Uses: Illustrating the Power of Ontologies in Dictionary-based Named Entity Recognition
Automatic annotation of text is an important complement to manual annotation, because the latter is highly labour intensive. We have developed a fast dictionary-based named entity recognition (NER) system and addressed a wide variety of biomedical problems by applied it to text from many different sources. We have used this tagger both in real-time tools to support curation efforts and in pipel...
متن کامل